Data Science on Blockchain, how to get started?

Two analysis examples on NFT and Helium

Thomas de Marchin

14DEC2022

Introduction

  • Associate Director, Statistics & Data Science at Pharmalex: support pharmaceutical companies in drug development
  • Gravitated towards blockchain technology and the place of data science within it
  • Lots of data available on the blockchain (~ 1x10^6 transaction per day on ethereum), how to access it?
  • Obtaining cryptocurrencies price data is straightforward, more sophisticated data manipulations require access to the “source” data
  • Github: https://github.com/tdemarchin
  • Blog: https://tdemarchin.medium.com

Blockchain is hard to read

  1. Several Tb of data. Ethereum ~ 6x10^9 transactions.

  2. Data are stored sequentially, requires developing specific tools to follow a transaction.

  3. The structure of a transaction is difficult to read

  4. Fragmentation of blockchain technologies

A huge hard drive

Image from ig.com

Structure of a transaction

How to get the data ?

  1. Set-up an ETL: Extract, Transform and Load. Most flexible but need to set up a server with huge and fast hard-drives.
  2. Use data providers:
  • Dashboards: Dune analytics, Nansen, GraphSense, icy.tools,…
  • APIs (software intermediary that allows two applications to talk to each other): OpenSea, Etherscan, Infura, The Graph,…

Two examples implemented in R:

  1. How to track and visualize NFT transactions on Ethereum
  2. Blockchain IoT data visualization and the rise of the Helium network

NFTs

https://www.larvalabs.com/cryptopunks

  • Non-Fungible Tokens: represent ownership of unique items (art, collectibles, patents, real estate,…)
  • Smart contracts: decentralized programs stored on a blockchain that run when predetermined conditions are met
  • What can we do with NFT-related data ?

NFTs price analysis with OpenSea API

OpenSea: big NFT market place

resOpenSea <- GET("https://api.opensea.io/api/v1/events",
          query = list(limit = 300, 
                       event_type = "successful", 
                       only_opensea = "true")) 

  • 300 transactions max, limited to OpenSea transactions
  • Pre-processed by OpenSea and not raw blockchain transactions
  • Mainly price analysis
  • Analysis done in June 2021: Link

NFTs tracking with EtherScan API: the data

https://weirdwhalesnft.com/

  • Weird Whales: collection of 3350 whales programmatically generated, each with their unique characteristics and traits. Created by a 12-year-old programmer named Benyamin Ahmed who made the buzz.
  • EtherScan: block explorer to view information about transactions, verify contract code… \(\rightarrow\) access to more raw data
  • Analysis done in October 2021 and updated in December 2022: Link
  • Weird Whales are managed by a specific smart contract on the Ethereum blockchain. You can find it on https://etherscan.io/.

  • To make it easier to extract information from the blockchain, we can read the events: dispatched signals (easy to read) the smart contracts can fire.

resEventTransfer <- GET("https://api.etherscan.io/api",
                          query = list(module = "logs", 
                                       action = "getLogs", 
                                       fromBlock = fromBlock, 
                                       toBlock = "latest",
                                       address = "0x96ed81c7f4406eff359e27bff6325dc3c9e042bd", 
                                       topic0 = "0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef",
                                       apikey = EtherScanAPIToken)) 

Where is the sales price? On OpenSea, sales are managed by the main contract and if approved, the second contract is called (here Weird Whales), which then triggers the transfer \(\rightarrow\) need to download all the transactions from the OpenSea main smart contract address and then filter for the ones related to Weird Whales (~ 10000 API calls, can take several hours…).

  • Ethereum / USD rate is highly volatile, need to obtain the historical ETH market price if we want to convert ETH to USD.
  • A spline is fitted on data from the CoinGecko API

NFTs tracking with EtherScan API: analysis and visualisation

  • High variability in prices at the begining (buzz), followed by a quieter period
  • Some NFTs are exchanged outside the OpenSea Market (price = 0) \(\rightarrow\) difficult to trace….
  • Starting March 2022, all sales price felt to 0 \(\rightarrow\) OpenSea was hacked in March, they probably updated their smart contract…
  • Perfect data to be plotted as a network

  • Networks are described by:

    • Vertices (or nodes): the wallet addresses
    • Edges (or links): the transactions
  • network package:

# create the network using network function
network <- network(edges,  
                   vertex.attr = vertices)

  • Each color represents a unique token ID (we restricted to tokens involved in at least 10 transactions to improve visibility)
  • All transactions of all the tokens originate from the single minting address (1)
  • Some addresses are involved in multiple transactions (curved edges)
  • Some tokens were transferred to an address only to be sent back to the sender…?
  • Made with the ggraph packages

About 2/3 of the transactions happened very shortly after the NFT’s creation

  • Made with the network and networkDynamic packages

Helium IoT network

 

Helium, the people’s network. Photo from Nima Mot.

 

A miner

 

Rewards

 

  • Helium: decentralized wireless infrastructure for IoT devices (environmental sensors, localisation sensors to track bike fleets,…), relies on a blockchain
  • People are incentivized to install hotspots and become a part of the network by earning Helium tokens
  • Questions:
    • How big is the Helium network infrastructure, where are located the hotspots?
    • Are they actively utilized, i.e. are they used to transfer data with connected devices?
  • Need to download the full blockchain \(\rightarrow\) Dewi ETL project is a dedicated ETL server \(\rightarrow\) CSV extracts in 10k/50k-block increments on a web server: centralized, bad practice!
  • Analysis done in March 2022: Link

Helium IoT network: infrastructure

Liege

  • What is the size of the infrastructure? \(\rightarrow\) hotspot data
  • H3: geospatial indexing system using a hierarchical hexagonal grid. At resolution 8: earth is covered by 691,776,122 hexagons \(\rightarrow\) we convert to lat/lng

  • Some owners own more than 50 hotspots!
  • Three phases: (1) a slow linear increase, (2) an exponential increase in the middle of 2021 followed by (3) a fast linear increase (did it continue?)

Cumulative growth of the network infrastructure in terms of number of hotspots added

  • We start by creating an empty world map on which we overlay the hotspot data.
  • Plotting all the individual hotspots on a map would be too much (> 500k hotspots) \(\rightarrow\) we cluster the hotspots into hexagons (hexbin and geom_hex)
  • Most hotspots are located in North America, Europe and Asia, mostly in big cities. Practically no hotspots in Africa, Russia and very few in South America. A few hotspots in the middle of the ocean… Data issue or cheating to increase rewards?

Helium IoT network: network usage

  • We understood that hotspots are distributed on the planet and provide good coverage in big cities. Are they being actively used by connected devices and how often? \(\rightarrow\) Transaction data
  • We can look at network usage from two perspectives: (1) check the volume of data exchanged and (2) check how often the hotspots have been involved in data transfer with connected devices

Total volume of data exchanged

Cumulative sum of the number of transactions

  • 40.48% hotspots did not participate in any transaction so far
  • Total volume of data exchanged is probably not a good metric \(\rightarrow\) network is intented to transfer small data across long distance
  • Slow linear increase followed by an exponential increase, which is finally followed by a fast linear increase
  • Glitch in November 2021: major outage of the blockchain
  • Despite having about 15% of the hotspots, Asia don’t seem to be so active in terms of data transfer

Evolution of the number of transactions across the globe

  • Transactions mainly occur in North America before mid 2020, then followed by a strong wave in Europe and Asia. Barely no transaction have occurred in South America and Africa.
  • Made with the gganimate package

 

Distribution of the transactions in the US

 

Distribution of the transactions in Belgium

 

  • US: Transactions are homogeneously distributed across the country although the peaks of activity (note that the legend is logarithmic!) are located around big cities (New York, Los Angeles, San Francisco, Miami)
  • BE: Transactions are not homogeneously distributed across the country \(\rightarrow\) most transactions happen in the upper part of the country, which is consistent with the lower part of the country being scarcely populated (Ardennes)
  • Made with the rayshader package, interactive version available in the original article

Conclusion

  • Approaches for reading the blockchains and visualizing blockchain transactions
  • NFT: how to download and plot a network of transactions associated to a NFT collection \(\rightarrow\) interesting patterns in the evolution of NFT prices, which can be linked to social media attention
  • Helium: techniques on how to summarise and visualise the network growth in term of infrastructure and data usage \(\rightarrow\) the infrastructure is well developed, the network has a lot of spare capacity

\(\rightarrow\) obtaining data in an appropriate format requires a deep understanding of the blockchain

\(\rightarrow\) visualisation of blockchain data is still challenging because of the huge amount of data, need data scientists to develop blockchain specific approaches

Next step: analyse the networks using the graph theory

Follow me on Medium, Linkedin and/or Twitter